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This paper develops an analytically tractable model for determining the equilibrium distribution 
of mismatch repair deficient strains in unicellular populations. The approach is based on the sin- 
gle fitness peak (SFP) model, which has been used in Eigen's quasispecies equations in order to 
understand various aspects of evolutionary dynamics. As with the quasispecies model, our model 
for mutator-nonmutator equilibrium undergoes a phase transition in the limit of infinite sequence 
length. This "repair catastrophe" occurs at a critical repair error probability of e r = L V i a /L, where 
L V ia denotes the length of the genome controlling viability, while L denotes the overall length of 
the genome. The repair catastrophe therefore occurs when the repair error probability exceeds the 
fraction of deleterious mutations. Our model also gives a quantitative estimate for the equilibrium 
fraction of mutators in Escherichia coli. 
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In order to preserve the integrity of their genomes, liv- 
ing systems have evolved sophisticated mechanisms for 
correcting errors in their DNA sequences [lj . Otherwise, 
genetic damage due to environmental factors such as ra- 
diation, metabolic free radicals, and mutagens, combined 
with replication errors, would lead to unviable organisms 
due to the unrecoverable loss of genetic information. This 
phenomenon, which was first characterized by Eigen in 
Q, is known as the "error catastrophe" @>Q. It has since 
been studied in a number of theoretical papers 00 0] 
(and references therein), and has also been observed ex- 
perimentally 0- 

Some of the error-correcting ability in living systems 
is already built into the DNA polymerases themselves. 
In Escherichia coli, the proofreading ability of the DNA 
polymerases Pol I and Pol II results in an error probabil- 
ity of 10~ 6 — 10~ 7 per base pair |l|. Additional enzymes 
continuously scan the DNA molecule, repairing lesions 
and mismatches that occur due to environmental dam- 
age. 

A key error-repair mechanism is known as mismatch re- 
pair, and occurs immediately following DNA replication. 
The mismatch repair system scans the DNA molecule, 
identifies, and then corrects mismatched base pairs. Mis- 
match repair in E. coli reduces the error probability in 
DNA replication to 10~ 8 - 10~ 10 per base pair Cells 
with inactivated mismatch repair consequently have mu- 
tation rates which are 10 to 10, 000 times higher than 
cells whose mismatch repair system is functioning. Be- 
cause of their higher than wild-type mutation rates, these 
"mutator" strains are believed to play an important role 
in the emergence of antibiotic resistance, and cancer in 
multicellular organisms H ES [H [H Q . 

To develop a model for the equilibrium distribution 
of mutators versus nonmutators in a unicellular popula- 
tion, we consider a genome of alphabet size S ("bases" 
0,1,...,S — 1), consisting of two genes. The first gene 
consists of L v i a bases, and controls the viability of the 



genome. The second gene consists of L rep bases, and 
codes for the enzymatic machinery involved in mismatch 
repair. If we let a denote an arbitrary gene sequence, 
then we may write, a = o- V i a o- rev . 

We assume a single fitness peak (SFP) model for 
both genes. Thus, there is a unique, "fit" sequence 
Co = o~ v ia.oo~ rep fi. A cell with genome a has a first-order 
growth rate constant k » 1 if a V i a — o~ V i a ^, and 1 oth- 
erwise. Mismatch repair has an error probability of e r 
per mismatched base pair, and is functioning only when 

0~rep — C*rep,0* 

While somewhat artificial, the SFP model has been 
successfully applied in 14] toward understanding the 
correlations between antibody and viral mutation rates. 
Furthermore, because proteins generally have a key set 
of conserved residues which more or less dictate their fi- 
nal structure and function, the corresponding gene has 
a subsequence of conserved bases required for its proper 
function (lLll5j. Thus, by summing over the unconserved 
bases, it is possible to reduce the fitness landscape to an 
SFP in the conserved subsequence. Therefore, there is 
reason to believe that many of the phenomenological as- 
pects of our system can be captured by an SFP-based 
approach, and that such an approach can also be semi- 
quantitative in a number of cases. 

The basic equation governing the dynamics on the 
genome space has the form of Eigen's quasispecies equa- 
tions |2|. L|. 

(K a - K(i))xg. + [K m (o-' , a)x a > - K m (a, a')x a ] 

(1) 

where x a denotes the fraction of the population with 
genome a, K a is the growth rate constant of a, K m (o~, a') 
denotes the mutation rate constant from er to a' , and 

= E<r ^<yX a {t). 

We assume that replication errors are sufficiently small 
so that we need only worry about point mutations, and 
that since k » 1, any flow off of the viability peak is 



dt 
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unidirectional. Furthermore, since we are only interested 
in the relative distribution of viable mutators and non- 
mutators, we only focus on a subspace of sequences given 
by g — Gviafi^rep- Thus, from now on, we shall sim- 
plify matters and redenote a rep as a. The full genome 
is cr V i a fl(j by implication. On this subspace, the effec- 
tive growth rate constant becomes k(l — L V i a e a ), due to 
leakage off of the fitness peak. Here e a denotes the per 
base pair replication error probability, and is equal to ee r 
if a = (To, and e otherwise. Finally, by the symmetry of 
our system, we may make the further assumption that x a 
depends only on the Hamming distance HD((T 7 a ) from 
(To- Thus, defining 0;((t) = {a'\HD(<r' , a) = I}, we may 
then also define xi = x ai where a G ili(aa). Note that 



point mutations between xi and some x a may only occur 
if x (j G ili.i±i((Jo)- However, we may negect intra- fi;(cro) 
couplings, due to cancellation of mutational inflows and 
outflows. 

A a G £li(ao) may be connected via a point mutation 
to a o' G £li-i(<Jo) by changing any one of the / bases 
distinct from the corresponding bases in <tq back to the 
corresponding base in ctq. Thus, there are I possible con- 
nections. Act G ^z(cro) may be connected via a point 
mutation to a a' G f2/+i((7o) by changing any one of the 
L rep — I bases equal to the corresponding bases in ctq. 
Since there are 5 — 1 possibilities per base, the result is 
(L rep — l)(S — 1) connections. The net mutational flow 
is then, 



J 



kl 



^2 Kn(o"',cr)av - K m (a 7 a')x a ] = _ ^ (ei- 1 xi- 1 - em) + k(L rep - l)(ei +1 x t+1 - em) (2) 



r 



where eo = ee r , and e; = e for I > 1. We divide the e's 
by S — 1 because a point mutation can occur to any one 
of the S — 1 bases distinct from the changed base. 
We also have, n{t) = k(l — L via e) + kL via e(l — e r )xo- 



Now, define C; = ( Lr ; cp )(5 f — 1)', the number of elements 
in 0;(<To), and set zi — Cm. If we reexpress the dy- 
namical equations in terms of Zi, then at equilibrium we 
obtain the system of equations, 
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Except for the last equation, the (/ + l) s * equation 
has a mutational contribution from zi+\ which scales as 
1 1 L rep . This means that the contribution to zi due to 
backmutation from z;+i becomes negligible for large se- 
quence lengths. This makes sense, since for finite I, the 
ratio Ci'/Ci — > oo as L 



rep 



oo for V > I, so the prob- 
ability of mutating to lower values of I vanishes in the 
limit of infinite sequence length. 

The term, (h v % a l L rep )(l — e r )zo(l — zq), and the cor- 
responding terms in the other equations arise from the 



R{t) term in the original quasispecics equations (Eq. (1)) 
of our model. Because the nonmutator sequence has a 
lower level of leakage off of the viability peak than the 
mutator sequences, it has a higher effective growth rate 
constant. Thus, when dealing with population fractions 
as opposed to absolute populations, the result is an ef- 
fective positive flow into the nonmutator sequence given 
by the term (L via / L rep )(l — e r )z (l — z ). By a similar 
argument, it can be seen why the corresponding terms in 
the remaining equations are negative. 
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We wish to solve these equations for a fixed value of 
ol = L V i a /L rep in the limit of infinite sequence length L. 
Let us focus first on the behavior of z in this limit. In 
the first equilibrium equation, the Z\ term drops out as 
L rep oo, giving, 



= z (a(l - e r )(l - z ) - e r ) 



(4) 



which has the solutions z = 0, 1 — e r /(a(l — e,.)). The 
first solution is inconsistent with the requirement that 
z = 1 when e r = 0. However, the second solution only 
holds as long as z G [0,1]. Clearly, z < 1 V e,.. The 
other requirement that zq > gives, 



e r < 



1 



L 



(5) 



Defining e r ,crit = j^ji we see that in the limit of infi- 
nite genome length, our system has two "phases." For 
e r < e r .crit the system is in a "non-mutator," or, equiv- 
alently, "repairer" phase, in which the fraction of non- 
mutators is a quantity which depends only on a and e r . 
At e r = e r , C rit the system undergoes a "phase" transi- 
tion, which we term the "repair catastrophe," after which 
the system is in a "mutator" ( "non- repairer" ) phase. In 
this phase, there is essentially no preference for being a 
non-mutator, and the fraction of non-mutators becomes 
inversely proportional to the total number of gene se- 
quences. 

A key parameter to study in the phase behavior of our 
model is the localization length, given by, 



(I) 



L re p 



(6) 



i=i 



This quantity measures the mean Hamming distance 
of the population from the nonmutator sequence. To 
compute (I) below the phase transition in the limit of 
Lrep — ► oo, we may note that for finite I our equilibrium 
equations become, 

= a(l - e r )z (l - z ) - e r z 
= — a(l — e r )zoZi + e r zo — z\ 



-a(l - e r )z zi + zi-i - zi 



(7) 



We have already solved the first equation. The next two 
equations can be solved together to give, for I > 1, 



zi = e r (l + a(l - e r )z ) l z 



(8) 



It should be noted that, while each zi converges to the 
corresponding formula given above as L rep — > oo, the 
convergence is not uniform, since the larger the I, the 



larger L rep must be made to get zi within some specified 



cutoff of its L rep — oo value. 



Define zj ;OC = \imi Jrep ^ oc z\. It may be readily checked 
that Ya= -o z (,oo — 1, so total population is conserved in 
this limiting process. The localization length is given by, 



<*> = !>,= 



1 ^r.crit 



€r,crit €r,crit 



(9) 



Note that, as expected, the localization length is finite 



for 



< e 



r,crit 5 



but diverges at the phase transition. 



It is also useful to solve the equilibrium distribution ex- 



actly for the case a = 0. This corresponds to L rep = L, 
that is, the entire genome consists of the repair gene. 
Note that e r ,crit = 0, so that the system is always in the 
mutator phase. In this case, it may be shown that the 
equilibrium solution is given by, z\ = ( Lr ; cp )(5 f — l) l e r zo 

for I > 0, and so the requirement that Yld=o z i — 1 gi ves > 
zq = 1/(1 + e r (S Lr ' p — 1)). It is readily shown that for 
large L rep , the localization length (l) — ► (1 — l/S)L rep . 
This result is equivalent to the case of a uniform distri- 
bution, which makes sense since for a = there is no 
preference for being a nonmutator in the limit of large 

The phase behavior which emerges from this model 
may be understood as follows: For highly efficient repair, 
the selective advantage for being a nonmutator is suffi- 
ciently large to cause the population to equilibrate in a 
localized cluster about the nonmutator sequence. When 
repair is inefficient, the accumulation of deleterious mu- 
tations in both mutators and nonmutators is comparable, 
and hence the mutators, which are entropically strongly 
favored, dominate the population. The selective advan- 
tage for being a nonmutator is dictated by a, since for 
low a there is relatively little leakage off of the fitness 
peak, while for high a there is a large amount of leakage 
off of the fitness peak. Thus, for low a, repair has to be 
highly efficient to give the nonmutators a sufficient selec- 
tive advantage to be in the nonmutator phase, while for 
high a, nonmutators have a significant advantage even 
for relatively inefficient repair. 

One of the main features to note regarding the 
mutator-nonmutator equilibrium is that it is independent 
of the background error probability e. This feature is in- 
teresting because as e — > 0, the difference in viability 
between the mutators and the nonmutators disappears. 
Thus, one might naively expect e r , C rit to be a function of 
e, but in the limit of small e (so that only point mutations 
are important), this is not the case. 

Figure 1 shows a plot of z versus e r for a = 1/9, 1, 
and 9. We used S = 2, and took a value of L rep = 
1,000 in order to sufficiently converge the calculations. 
The equilibrium equations were solved by using fixed- 
point iteration at every e r . Note that the phase transition 
does indeed occur at the predicted values of e r ,crit- The 
analytical, L rep — oo curves lie essentially on top of our 
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FIG. 1: Plots of zq versus e r for a = 1/9, 1, and 9. 
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FIG. 2: Plots of (I) versus e r for a = 1/9, 1, and 9. 



numerical results, and were therefore not plotted here. 

Figure 2 shows the corresponding plots of (l) versus 
e r . Note that the localization lengths settle at the value 
of L rep /2 at the transition, which is the expected finite 
L rep behavior for the mutator phase. 

Finally, we may use our model to estimate the equilib- 
rium fraction of mismatch repair deficient strains in E. 
coli. The E. coli genome has 4, 639, 221 base pairs, com- 
prising 4, 403 genes 0] . Based on calculations for Sac- 
charomyces cerivisiae, or Baker's yeast, we esimate that 
between 18 — 30% of these genes are "viability" genes, 
i.e, required for E. coli survival |l7l Hj| (unfortunately, 
similarly detailed data is not currently available for E. 
coli, so we had to make an estimate based on available 
information). Thus, we assume approximately 1,000 "vi- 
ability" genes, which we gather into the viability peak of 
our model. The mismatch repair system is controlled 
by the MutH, MutL, MutS, and UvrD (or MutU) pro- 
teins, giving 4 repair genes. If we simply use the average 
gene length, and assume that the same fraction of base 
pairs must be conserved in both the viability genes and 
in the mismatch repair genes, then in our model we ob- 
tain a w 1000/4 = 250. Since mismatch repair has a 
failure probability of 10~ 4 — 10 _1 per mismatched base 



pair, we estimate an equilibrium fraction of mutators in 
the range of 4 x 10~ 7 — 4 x 10~ 4 . For E. coli, the ob- 
served equilibrium fraction of mutators is on the order 
of 10~ 5 — 10 -3 9]. While encouraging, our result is nev- 
ertheless based on a number of simplifying assumptions. 
The strongest evidence in support of our model would be 
the experimental observation of the repair catastrophe 
itself. While it is not clear how to selectively control the 
efficiency of the mismatch repair system, if possible this 
would allow a direct experimental test of our model. 

As a concluding remark, we should note that our pre- 
diction of a repair catastrophe in mutator-nonmutator 
equilibrium suggests that phase transitions may underlie 
the behavior of a variety of biological systems. A clas- 
sification of the phase behaviors inherent in various bio- 
logical networks will greatly increase our understanding 
of the underlying dynamics governing such systems. 
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